1. Examining Continuous Variables

1.1 Introduction

A continuous variable in theory can take any value over its range. A handy exemple of countinuous variable can be:

  • age: you can’t count “age” because it would literally take “forever”. you could be: 25 years, 10 months, 2 days, 5 hours, 4 seconds, 4 milliseconds, 8 nanoseconds and so on..
  • time: it will also take “forever” to count the time; how many nanoseconds since we start reading this line?
  • In practice data for continuous variables are generally rounded to some level of measurement accuracy.

Different plotting methods are used to display the data distribution for continuous variables.

The purpose of those methods is to highlight important features of data. Some plots emphasise one feature over another, some are very specialised, some require more deciphering than others.

Two possible approaches are to use a range of different plot types or to use a variety of different plots of the same type.

The plots types we are going to use in this chapter are histograms and boxplots.

1.2. What features might continuous variable have?

Let’s see a list of features for continuous variables:

  • Assymetry: the distribution is skewed to the left or right;
  • Outliers: one or more values that are far from the rest of the data.
  • Multimodality: the distribution has more than one peak;
  • Gaps: there are ranges of values within the data where no cases are recorded;
  • Heaping: some values occur unexpectedly often;
  • Rounding: only certain values are found because of the rounding process;
  • Impossibilities: values outside the feasible range; you can’t be -1 year old;
  • Errors: values that look wrong for one reason or another

Graphics are good for displaying the features**

They can provide more and different kinds of information than a set of summary statistics.

1.3. Looking for features

Next we are going to analyse a data set using histograms in order to see which features might be present and how we can find them.

The Galton’s Height Data is a famous 1885 study of Francis Galton exploring the relationship between the heights of adult children and the heights of their parents.

The galton dataset from UsingR package includes data on heights for 928 children and 205 ‘midparents’. Using the code below we are creating Children histogram and Parents histogram.

data(galton, package="UsingR")
ht <- "height (in)"
par(mfrow=c(1,2), las=1, mar=c(3.1, 4.1, 1.1, 2.1))
with(galton, {
hist(child, xlab=ht, main="Children", col="green")
hist(parent, xlab=ht, main="Parents", col="blue")})

First things first, we can see that:

  • the distributions are roughly symmetric and there is more spread amongst the children;
  • there appear to be no outliers;
  • although the histograms appear to have different binwidths they are actually the same;

To look for gaps or heaping we create next histograms:

par(mfrow=c(1,2), las=1, mar=c(3.1, 4.1, 1.1, 2.1))
with(galton, {
MASS::truehist(child, h=0.1)
MASS::truehist(parent, h=0.1)})

In both histograms there appear to be narrower gaps between values at the boundaries. Clearly the data were aggregated or reported to a limited level of precision because we can see how few height values actually occur.

Next figure displays the histograms one above the other with equal scales and binwidths. The x-axis scale limits were chosen to include the full range of the data and the y-axis limits were chosen by inspection.

library(ggplot2)
library(gridExtra)
c1 <- ggplot(galton, aes(child)) + geom_bar( binwidth=1) +
xlim(60, 75) + ylim(0, 225) + ylab("") +
geom_vline(xintercept=median(galton$child),
col="red")
p1 <- ggplot(galton, aes(parent)) + geom_bar( binwidth=1) +
xlim(60, 75) + ylim(0, 225) + ylab("") +
geom_vline(xintercept=median(galton$parent),
col="red")
grid.arrange(c1, p1)

We can see that Parents’ heights are spread less and their median is slightly higher. Although the means are very similar, the parents’ distribution is clearly less variable. This is because each parent value is an average of two values.

1.4. Comparing distributions by subgroups

A population can be made up of several groups. To get more informations, we can compare the distributions of variable across the groups.

In order to approach this problem, we can use Boxplots because they make such an efficient use of the space available.

Boxplots:

The boxplot() function takes in any number of numeric vectors, drawing a boxplot for each vector. We will use airquality dataset in order to see the air quality for each month; In the example, Temp can be our numeric vector. Month can be our grouping variable, so that we get the boxplot for each month separately. In our dataset, month is in the form of number 5=May, 6=June and so on).

boxplot(Temp~Month,
data=airquality,
main="Different boxplots for each month",
xlab="Month Number",
ylab="Degree Fahrenheit",
col="orange",
border="brown"
)

It is clear from the above figure that the July month (7) is relatively hotter than the rest. Although the boxplots convey a lot of information in a limited space, they have two disadvantages: If a distribution is not unimodal, you can’t see that, and it is difficult to convey information on how big the different groups are.

1.5. What plots are there for individual continuous variables?

To display continuous data graphically we can use:

  • histogram is used for grouping data into intervals, and drawing a bar for each interval, shows the empirical distribution,
  • boxplot for displaying individual outliers and robust statistics for the data, useful for identifying outliers and for comparisons of distributions across subgroups,
  • dotplot plotting each point individually as a dot, good for spotting gaps in the data,
  • rugplot plotting each point individually as a line, often used as an additional plot along the horizontal axis of another display,
  • density estimate plotting an estimated density of the variable’s distribution, so more like a model than a data display,
  • distribution estimate showing the estimated distribution function, useful for comparing distributions, if one is always ‘ahead’ of another,
  • Q-Q plot comparing the distribution to a theoretical one, normal distribution.

1.6. Plot options

What are the options for plottings?

  • Binwidths: The width of the bins. Can be specified as a numeric value or as a function that calculates width from unscaled x. Exploring multiple widths, we can find the best to ilustrate the stories in our data. This option is crucial for plotting histograms.
  • Unequal binwidths: While the idea of using unequal binwidths is attractive, it is awkward to apply in practice because the it displays confusing and hard to interpret data. If you still want to do it, consider a variable transformation instead.
  • Banwidth for density estimates: The bandwidth selector is a function of four arguments: The data x, a kernel string kernel, a start string start, and a support vector support. To obtain the functions associated with these strings, use get_kernel and get_start. The function should return a double. Crucial while displaying density estimates.

1.7. Modelling and testing for continuous variables

Let’s see the way of modelling and testing for continuous variable:

  • Means is the most common test for continuous data is to test the mean in some way, either against a standard value, or in comparison to the means of other variables, or by subsets. Mostly a t-test is used. Alternatively medians may be tested, especially in conjunction with using boxplots.
  • Symmetry. Its role is to improve the power of the test, using bootstrapping.
  • Normality tests like Anderson-Darling, Shapiro-Wilk, Kolmogorov-Smirnov have low power for small samples and may be rather too powerful for really large samples. A large sample will tend to have some feature that will lead to rejection of the null hypothesis.
  • Density estimation in R can be applied using one of the many R packages wich offer some for of density estimation. This model is fitting the density estimates without testing. Remember that densities for variables with strict boundaries need special treatment at the boundaries.
  • Outliers How useful a outlier test might be depends on the particular application. Testing for outliers we need to prevent both masking: one group of outliers prevents you recognising another and swamping: mistaking standard observations for outliers.
  • Multimodality test can be approach by the dip test. It was proposed in Hartigan and Hartigan, 1985 and it is available in R in the appropriately named package diptest.

1.8. Main points of chapter

  • There are lots of different features that can arise in the frequency distributions of single continuous variables.
  • There is no optimal type of plot and no optimal version of a plot type. Look at several different plots and several different versions of each.
  • Natural binwidths based on context are usually a good choice for histograms.
  • Histograms are good for emphasising features of the raw data, while density estimates are better for suggesting underlying models for the data.
  • Boxplots are best for identifying outliers and for comparing distributions across subgroups.

2. Looking for Structure: Dependency Relationships and Associations

2.1. Introduction

Scatterplots can reveal structure that is not readily apparent from summary statistics or from models. They are the basis of the Gapminder displays, used to draw attention to patterns of world development.

Roles of scatterplots are:
  • revealing associations between variables;
  • identifying outliers;
  • spotting distributional features.

Over ten thousand athletes competed in the London Summer Olympics of 2012. The figure below shows a scatterplot of Weight against Height from the dataset oly12 in the package VGAMdata.

In the following scatterplot we can see that weight increases with height and there are several outliers which affect the scales.

library(ggplot2)

data(oly12, package = "VGAMdata")
ggplot(oly12, aes(Height, Weight)) + geom_point() +
  ggtitle("Athletes at the London Olympics 2012")

Also, giving the large number of points, there is also a lot of overplotting, and most of the points in the middle of the plot represent more that one case (there are 57 athletes who are 1.7m height and weigh 60kg).


2.2. Features you can see with scatterplots

Features that might be found in a scatterplot include:

  • casual relationships (linear and nonlinear) - one variable may have a direct influence on another; the dependent variable should be put on the vertical axis);
  • associations;
  • outliers or groups of outliers;
  • clusters (groups of cases which are separate from the rest of the data);
  • gaps (particular combinations of values that do not occur);
  • barriers (some combinations of values that may not be possible);
  • conditional relationship (the relationship between two variables is better summarised by a conditional description than by a function).


2.3. Looking at pairs of continuous variables

Drinks Wages

A scatterplot of average wages against proportion of drinkers for all 70 trades in Pearson’s DrinksWages dataset in the HistData package shows that there are a surprising number of trades where either all workers are drinkers or none are.

data(DrinksWages, package = "HistData")
ggplot(DrinksWages, aes(drinks / n, wage)) + geom_point() +
  xlab("Proportion of drinkers") + xlim(0, 1) + ylim(0, 40)

Analyzing the distribution of numbers in the study by trade, we observe that over a third have only 1 or 2 members in the survey, which affects the graphic.
After excluding the smaller trades, with less than five members, there is still no particular pattern and no evidence of drink being associated with lower average wages.


Old faithful geyser

The dataset geyser in MASS provides 299 observations of the duration of eruptions and the time to the next eruption for the Old Faithful geyser in Yellowstone Park.

library(ggplot2)

data(geyser, package = "MASS")

ggplot(geyser, aes(duration, waiting)) + geom_point()

A short duration implies a long waiting time until the next eruption, while a long duration can imply a short or long waiting time. There may be 3 clusters and possibly a couple of outlying values, but there is also a suggestion of rounded values for the eruption durations (note the numbers of durations of 2 and 4 minutes). To assess the possibility of clustering, consider a density estimate. The figure below displays contours of a bivariate density estimate supporting the idea of there being three clusters.
ggplot(geyser, aes(duration, waiting)) + geom_point() +
  geom_density2d()

There is evidence of:
  • three concentrations of data;
  • two univariate outliers (one eruption with low duration and one with a high waiting time until the next eruption);
  • one bivariate outlier.


Movie ratings

The distribution of movie lengths in the dataset movies from ggplot2, discussed previously, will be displayed now with a focus on two other variables, rating (average IMDB user rating) and votes (number of people who rated the movie).

library(ggplot2)

data(movies, package = "ggplot2movies")
ggplot(movies, aes(votes, rating)) + geom_point() + ylim(1, 10)

Films rated very often have higher average ratings, while the highest ratings are achieved by films that are rated far less often.
There are many insights that can be gained from the plot above:
  • There are no films with lots of votes and a low average rating.
  • For films with more than a small number of votes, the average rating increases with the number of votes
  • No film with lots of votes has an average rating close to the maximum possible.
  • A few films with a high number of votes, over 50,000, look like outliers. They have a distinctly lower rating than other films with similar numbers of votes.
  • Films with a low number of votes may have any average rating from the lowest to the highest.
  • The only films with very high average ratings are films with relatively few votes.


2.4. Adding models: lines and smooths

Pearson heights

Adding lines or smooths to scatterplots is easy and often provides valuable guidance, as can be seen comparing the figures below, where the relationship between sons’ and fathers’ heights, from dataset father.son, is difficult to assess from the scatterplot alone.

The best fit regression line has a slope of just over 0.5, as can be seen by comparison with the line y = x. The height of a man is influenced by the height of his father, but there is a lot of unexplained variability.
data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
geom_smooth(method="lm", colour="red") +
geom_abline(slope=1, intercept=0)
## `geom_smooth()` using formula 'y ~ x'

To explore further whether a non-linear model might be warranted, we can fit a smoother or plot a smoother and the best fit regression line together.
library(ggplot2)

data(father.son, package = "UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
  geom_smooth(method = "lm",
              colour = "red",
              se = FALSE) +
  stat_smooth()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Adding a line or a smooth (or both) makes modelling explicit and confirms the subjective impressions we may have, by checking with other views of the data, from a more objective point of view.


2.5. Comparing groups within scatterplots

The Olympic athletes’ dataset shown in Chapter 2.1 suffered from overplotting, which can be overcome by splitting up the data by possible explanatory variables. In this case, we can separate the plot by sex: a scatterplot for females and one for males.

ggplot(oly12, aes(Height, Weight)) +
  geom_point(size = 1) + facet_wrap( ~ Sex, ncol = 1)

Plotting all the 42 sports from the dataset together we get the following figure:

oly12S <- within(oly12, Sport <- abbreviate(Sport, 12))
ggplot(oly12S, aes(Height, Weight)) +
  geom_point(size = 1) + facet_wrap(~ Sport) +
  ggtitle("Weight and Height by Sport")

There can be made a set of observations based on the scatterplot above:
  • data is missing for some sports;
  • the association between weight and height looks linear for most sports;
  • with judo and wrestling, for example, the relationship is weaker, possibly because of the range of different weight classes in these sports.


2.6. Scatterplot matrices for looking at many pairs of variables

Scatterplot matrices (sploms) are tables of scatterplots with each variable plotted against all of the others. They give excellent initial overviews of the relationships between continuous variables in datasets with small numbers of variables.

Crime in the U.S.

The dataset crime.us from the VGAMdata package includes the absolute crime figures and the crime rates by population for the fifty U.S. states in 2009.
The figure below shows a splom of the rates for seven kinds of crime. The dataset also includes rates for the four crimes of violence together and density estimates for the variables down the diagonal.
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
data(crime.us, package = "VGAMdata")
crime.usR <- crime.us
names(crime.usR) <- gsub("*Rate", "", names(crime.usR))
names(crime.usR)[19:20] <- c("Larceny", "MotorVTheft")

ggpairs(
  crime.usR[, c(13:16, 18:20)],
  title = "Crime rates in the USA",
  diag = list(continuous = 'densityDiag'),
  axisLabels = 'none'
)

We observe that some rates are positively associated, others not. Also, the crime of rape is least associated with the others.


Swiss banknotes

The dataset bank from the gclus package includes six measurements on each of 100 genuine Swiss banknotes and 100 forged notes.
library(car)
## Loading required package: carData
data(bank, package = "gclus")
par(mar = c(1.1, 1.1, 1.1, 1.1))
spm(
  dplyr::select(bank, Length:Diagonal),
  pch = c(16, 16),
  diagonal = "histogram",
  smoother = FALSE,
  reg.line = FALSE,
  groups = bank$Status
)

The above splom has forged notes coloured in pink. In some of the scatterplots the groups of notes are well separated. Some variables are associated, some not. There are a few possible outliers, not all of them forgeries.


Functions for drawing sploms

function package
plot
pairs
spm car
cpairs gclus
splom lattice
ggpairs GGally
gpairs gpairs
pairs.mod SMPracticals
pairscor.fnc languageR

Note: Sploms are ineffective for large number of variables (a splom for m variables includes m(m-1) scatterplots and m diagonal elements). However, they can be useful for giving a quick overview of a few variables.


2.7. Scatterplot options

Point size

  • larger (for emphasising outliers);
  • smaller (for distinguishing groups of points);
  • to represent the value of a continous non-negative variable.

Symbols for points

  • if the points do not overlap and are easy to identify, we should draw the members of each group with a different symbol to tell them apart;
  • with large datasets, we should use a set of scatterplots for each group.

Alpha-blending

  • each point is given a weight between 0 and 1;
  • where several points overlap, the resulting area is drawn correspondingly darker (or more opaque).

Colouring points

  • used to distinguish points by groups;
  • if there are points on top of one another, the visible colour is the colour of the last point drawn.


2.8. Modelling and testing for relationships between variables

Correlation

Correlation coefficients measure linear association, but to learn more about what they might mean we should also draw a scatterplot. Likewise, if we draw a scatterplot and we see a linear association, we should calculate the previously mentioned coefficient to measure exactly the level of correlation.

Regression

Given a presumed causal relationship between the y-variable and the x-variable of a scatterplot, regression may be used to fit a model. It can be helpful to overlay the model on a scatterplot of the data, to add confidence bands for the fit, or to add predictive intervals for possible new points.

Smoothing

If no analytic model is proposed for Y as a function of X, then a nonlinear smoother can be tried, for example loess (local weighted regression).

Bivariate density estimation

kde2d from MASS, kde from ks, bkde2d from KernSmooth or the highest density region package hdrcde may be used.

Outliers

Points which are outliers in scatterplots may be outliers on one of the two dimensions individually or purely bivariate outliers, which can be found out by using density estimators.

2.9. Main points

  • Scatterplots can take many different forms and provide a lot of information about the relationship between two variables.
  • Adding lines or smooths to scatterplots is easy and often provides valuable guidance, as can be seen in the plot from 2.4, where the relationship between sons’ and fathers’ heights is difficult to assess from the scatterplot alone.
  • Trellis displays are very effective for comparing scatterplots by subgroup, especially when the groups overlap, like in the case of the Olympic games scatterplot.
  • Sploms are excellent for giving a quick overview of a few variables. For example, the plot from 5.6 summarises the U.S. crime rates dataset well.


3. Investigating Multivariate Continuous Data

3.1. Introduction

Traditional methods used for displaying multivariate continous data:

  • scatterplots. Disadvantage: only for bivariate data, not effective for exploring in higher dimensions;
  • matrices of scatterplots. Disadvantage: do not really convey what multivariate structure there might be in the data;
  • dimension reduction methods (PCA, MDS, etc). Disadvantage: they approximate the data, and it is difficult to assess how good the approximations are.

New Method: parallel coordinate plots.

3.2. What is a parallel coordinate plot (pcp)?

Down below we can observe a parallel coordinate plot of the six variables in the food dataset divided by the variable weight.grams, the weight of a serving.

library('GGally'); data(food, package="MMST")
names(food) <- c("Fat", "Food.energy", "Carbohyd", "Protein", "Cholest", "Wt", "Satur.Fat")
food1 <- food/food$Wt
ggparcoord(data = food1, columns=c(1:5, 7), scale="uniminmax", alphaLines=0.2) + 
  xlab("") + ylab("")

In a parallel coordinate plot:

  • all axes are parallel to one another;
  • each variable has its own individual vertical axis (or alternatively all the axes are horizontal);
  • each axis is usually scaled from the minimum to the maximum case values for the variable;
  • each polygonal line across all axes defines a profile.


Functions for drawing pcp’s in R

function package
ggparcoord GGally
parcoord MASS
cparcoord gclus
pcp PairViz
parallelplot lattice
parallel.ade epade
ipcp iplots

3.3. Features you can see with parallel coordinate plots

Parallel coordinate plots are very useful for checking results like:

  • outliers
  • groups of cases with very similar values (clusters)
  • bivariate associations between adjacent variables (correlations)
  • gaps or concentrations of data
  • skewness of univariate distributions for several variables at once


Illustrating Fisher’s iris dataset

In a pcp of the data in the figure down below we can see how clearly the setosa species is separated from the other two species on the petal measurements, how it is lower on three of the measurements, but higher on Sepal.Width, where there is also a setosa outlier, how the two petal measurements almost separate the other two species, and how additionally there are two unusual virginica values on Sepal.Width

ggparcoord(iris, columns=1:4, groupColumn="Species")

3.4. Interpreting clustering results

We can compare the clusters graphically with a pcp in which the clusters are given different colours.

In the USArrests dataset from 1973, the Assault variable separates the clusters, and the reason becomes obvious if we draw the same plot with the option scale=“globalminmax”, which puts all the variables on a common scale: Assault has a much higher range of values than the other three variables, so it dominates the multivariate distance calculations.

hcav <- hclust(dist(USArrests), method="ave")
clu3 <- cutree(hcav, k=3)
clus <- factor(clu3)
usa1 <- cbind(USArrests, clus)
ggparcoord(usa1, columns=1:4, groupColumn="clus", scale="uniminmax") + 
  xlab("") + ylab("") + 
  theme(legend.position = "none") +
  geom_line(size = 1.1)

ggparcoord(usa1, columns=1:4, groupColumn="clus", scale="globalminmax") + 
  xlab("") + ylab("") + 
  theme(legend.position = "none") +
  geom_line(size = 1.1)

3.5. Parallel coordinate plots and time series

In a parallel coordinate plot used for representing time series:

  • each axis represents one of the time points;
  • we cannot change the order of the axis;
  • a common scaling is used for all axes (scale=globalminmax).

Note: ipcp method of iplots package is better for creating interactive time series pcp’s.


Illustrating the acres of corn planted by US states from 1866 to 2011

Down below we can observe a pcp of the acres of corn planted by US states from 1866 to 2011 from the nass.corn dataset in the agridat package. Only the states with more than 250,000 acres planted in 2011 and no missing values are included.

The sharp rise for five states in the 19th century, the dips at the time of the Depression in the 1930s and in the early 1980s, and various unusual individual patterns are all visible in this pcp.

library(reshape2); data(nass.corn, package="agridat")
c1 <- melt(nass.corn, id=c("year", "state"))
c1 <- within(c1, StateV <- interaction(state, variable))
c2 <- dcast(c1, StateV~year)
ggparcoord(subset(c2[1:48,], c2[1:48,147]> 250000), columns=2:147, groupColumn="StateV", scale="globalminmax") + 
  xlab("Year") + ylab("Acres") + 
  scale_x_discrete(breaks=seq(1865, 2015, 10)) + 
  theme(legend.position = "none")

3.6. Parallel coordinate plots for indices

Usually, index values are weighted combinations of the values of their components. Examples:

  • stock market indices - used to represent performances of groups of shares;
  • consumer price index - summarises prices across a wide range of products;
  • average teaching score - an overall index used for ranking universities in UK.

One way to represent them in pcp’s:

  • the index itself will have one axes associated;
  • for each of the components there will be a seperate axes;
  • the polygonal lines are the profiles of the individuals;
  • more effective if some individuals or groups are identified by colour.


Illustrating the Russell group from the uniranks dataset

Here we can see a pcp of the uniranks dataset from the GDAdata package. 120 universities in the UK were ranked using a combination of eight criteria. The AverageTeachingScore is the overall index and the criteria are the components. The Russell group is made up of 24 of the top UK universities. In this figure they have been coloured red to highlight them.

data(uniranks, package="GDAdata")
names(uniranks)[c(5, 6, 8, 10, 11, 13)] <- c("AvTeach", "NSSTeach", "SpendperSt", "Careers", "VAddScore", "NSSFeedb")
uniranks1 <- within(uniranks, StaffStu <- 1/(StudentStaffRatio))
uniranks2 <- within(uniranks1,
                    Rus <- ifelse(UniGroup=="Russell", "Russell", "not"))
ggparcoord(uniranks2[order(uniranks2$Rus, decreasing=FALSE),],
           columns=c(5:8, 10:14),
           order=c(5,12,8,9,14,6,13,7,11,10),
           groupColumn="Rus", scale="uniminmax") +
           xlab("") + ylab("") +
           theme(legend.position = "none",
           axis.ticks.y = element_blank(),
           axis.text.y = element_blank()) +
           scale_colour_manual(values = c("gray","red")) + 
           geom_line(size = 0.8)

3.7. Options for parallel coordinate plots

3.7.1. Alignment

Instead of aligning the variables by their minimum and maximum values, they can be aligned by their mean or median with a corresponding adjustment to the minimum and maximum limits. This permits comparing the variability across variables better. The nass.corn dataset from 3.5. Parallel coordinate plots and time series has been used to illustrate this kind of alignment by mean.
mz <- as.data.frame(apply(c2[1:48,2:147], 2,
                          function(x) x - mean(x,na.rm=TRUE)))
StateV <- c2[1:48,1]
mzA <- as.data.frame(cbind(StateV, mz))
ggparcoord(mzA, columns=2:147, scale="globalminmax", groupColumn="StateV") +
  xlab("Year") + ylab("Acres") +
  scale_x_discrete(breaks=seq(1865, 2015, 10)) +
  theme(legend.position = "none")

3.7.2. Scaling

Depending on the situation, there are different types of scaling:
  • standardise each variable by substracting the mean and divinding by the standard deviation (scale=std)
  • use the respective minimum and maximum values for each variable (scale=uniminmax)
  • use a common scale so that the data will not be altered (scale=globalminmax)
  • standardize vertical height and then center each varialbe around the median (scale=center, scaleSummary=median)
Here we have used the 24 measurements for the 507 men and women in the body dataset, each pcp having a different type of scaling.
library('gridExtra'); data(body, package="gclus")
body1 <- body
names(body1) <- abbreviate(names(body), 2)
names(body1)[c(4:5, 11:13, 19:21)] <- c("CDp", "CD", "Ch", "Ws", "Ab", "Cl", "An", "Wr")
a1 <- ggparcoord(body1, columns=1:24, alphaLines=0.1) + xlab("") + ylab("")
a2 <- ggparcoord(body1, columns=1:24, scale="uniminmax", alphaLines=0.1) + xlab("") + ylab("")
a3 <- ggparcoord(body1, columns=1:24, scale="globalminmax", alphaLines=0.1) + xlab("") + ylab("")
a4 <- ggparcoord(body1, columns=1:24, scale="center", scaleSummary="median", alphaLines=0.1) + xlab("") + ylab("")
grid.arrange(a1, a2, a3, a4)

3.7.3. Outliers

Sometimes we want to redraw the plot without showing the outliers. There are 3 ways:
  • remove the cases with any outliers
  • trim all outlier values to the chosen limits
  • restrict the plot to the chosen limits
Remove the cases with outliers (food dataset)
fc <- function(xv) {
  bu <- boxplot(xv, plot=FALSE)$stats[5]
  cxv <- ifelse(xv > bu, NA, xv)
  bl <- boxplot(xv, plot=FALSE)$stats[1]
  cxv <- ifelse(cxv < bl, NA, cxv)
}
data(food, package="MMST")
rxfood <- as.data.frame(apply(food,2,fc))
ggparcoord(data = rxfood, columns = c(1:7), 
           scale="uniminmax", missing="exclude", 
           alphaLines=0.3) + xlab("") + ylab("")

Trim all outliers to the chosen limits (food dataset)
fb <- function(xv) {
  bu <- boxplot(xv, plot=FALSE)$stats[5]
  rxv <- ifelse(xv > bu, bu, xv)
  bl <- boxplot(xv, plot=FALSE)$stats[1]
  rxv <- ifelse(rxv < bl, bl, rxv)
}
data(food, package="MMST")
rfood <- as.data.frame(apply(food,2,fb))
ggparcoord(data = rfood, columns = c(1:7), 
           scale="uniminmax", alphaLines=0.3)

Restrict the plot to the chosen limits (food dataset)
fd <- function(xv) {
  bu <- boxplot(xv, plot=FALSE)$stats[5]
  bl <- boxplot(xv, plot=FALSE)$stats[1]
  dxv <- (xv - bl)/(bu - bl)
}
data(food, package="MMST")
rofood <- as.data.frame(apply(food,2,fd))
ggparcoord(data = rofood, columns = c(1:7)) + 
           coord_cartesian(ylim=c(0,1))

3.7.4. Variable order

Order the axes by the F statistics from analyses of variance based on a grouping variable (as illustrated in body dataset)
# data(body, package="gclus")
# body1 <- body
body1$Gn <- factor(body1$Gn)
ggparcoord(body1, columns=1:24, scale="uniminmax", 
           alphaLines=0.4, groupColumn="Gn",
           order="allClass") + xlab("") + ylab("") + 
           theme(legend.position = "none",
           axis.ticks.y = element_blank(),
           axis.text.y = element_blank())

  • we can display two graphics, one for women and one for men, to avoid overplotting for the variables where the genders overlap.
a <- ggparcoord(body1[order(body1$Gn),], columns=c(1:24),
                groupColumn="Gn", order="allClass",
                scale="uniminmax") + xlab("") + ylab("") +
                theme(legend.position = "none",
                axis.ticks.y = element_blank(),
                axis.text.y = element_blank()) +
                scale_colour_manual(values = c("grey","#00BFC4"))
b <- ggparcoord(body1[order(body1$Gn, decreasing=TRUE),],
                columns=c(1:24), groupColumn="Gn", order="allClass",
                scale="uniminmax") + xlab("") + ylab("") +
                theme(legend.position = "none",
                axis.ticks.y = element_blank(),
                axis.text.y = element_blank()) +
                scale_colour_manual(values = c("#F8766D","grey"))
grid.arrange(a,b)

Order the axes by their median values (body dataset)
library('dplyr')
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
m2 <- apply(body[, 1:24], 2, median, na.rm=TRUE)
m2a <- order(m2)
ggparcoord(data = select(body, -Gender), alphaLines=0.3,
           scale="globalminmax", order=m2a) + coord_flip()

Order the axes by their maximum values after standardization by mean and standard deviation (body dataset)
B1 <- ggparcoord(data = body1, columns=c(1:24), scale="std")
B2 <- acast(B1$data[ ,c(2,4,5)], .ID ~ variable)
m4 <- apply(B2, 2, max, na.rm=TRUE)
m4r <- order(m4)
ggparcoord(data = body1, alphaLines=0.3,
           columns=c(1:24), scale="std", order=m4r)

3.7.5. Formatting

Type of display
  • by default cases are represented as polygonal lines;
  • display individual points (showPoints=TRUE);
  • plot boxplots for each variable (boxplot=TRUE).
Missings
  • by default exclude cases with any missings;
  • impute missing values.
Aspect ratio
  • pcp’s are best drawn wide and moderately high;
  • usually the appropriate window size depends on the data.
Orientation
  • better to draw pcp’s horizontally rather than vertically so that we can write variable names horizontally one above the other, avoiding overlapping texts.
Lines
  • their thickness can be varied;
  • given the large number of lines, they should always be very thin;
  • line colour and alpha-blending need more attention.
Colour
  • lines can be coloured in groups by a factor variable or on a continuous shading scale by a numeric variable;
data(Boston, package="MASS")
Boston1 <- within(Boston,
                  hmedv <- factor(ifelse(medv == 50,"Top", "Rest")))
Boston1 <- within(Boston1, mlevel <- ifelse(medv==50,1,0.1))
Boston1 <- within(Boston1, medv1 <- medv)
a <- ggparcoord(data = Boston1[order(Boston1$hmedv),],
                columns=c(1:14), groupColumn="hmedv",
                scale="uniminmax", alphaLines="mlevel",
                mapping = aes(size = 1)) + xlab("") + ylab("") +
                theme(axis.ticks.y = element_blank(),
                axis.text.y = element_blank())
b <- ggparcoord(data = Boston1, columns=c(1:14),
                groupColumn="medv1", scale="uniminmax") +
                xlab("") + ylab("") +
                theme(axis.ticks.y = element_blank(),
                axis.text.y = element_blank())
grid.arrange(a,b)

Alpha-blending

The option alphaLines in ggparcoord can be used:

  • to make some groups stand out more than others;
  • to apply a user-specified variable in which the level of alpha-blending can be set for each case individually.
Boston1 <- Boston1 %>% mutate(
           arad = factor(ifelse(rad < max(rad), 0, 1)),
           aLevel = ifelse(rad < max(rad), 0.1, 1))
ggparcoord(data = Boston1, columns=c(1:14),
           scale="uniminmax", groupColumn= "arad",
           alphaLines="aLevel", order="allClass") +
           xlab("") + ylab("") +
           theme(legend.position = "none",
           axis.ticks.y = element_blank(),
           axis.text.y = element_blank())

3.8. Main points

  • Parallel coordinate plots are a powerful multivariate display, showing all continuous variables at once;

  • There are many pcp display options in addition to the usual formatting options applicable to other graphics;

  • Pcp’s are helpful for exploring and evaluating the results of analyses such as discriminant or cluster analyses. They are also useful for checking multivariate outliers;

  • Pcp’s are an alternative means for presenting multiple regularly spaced time series;

  • Pcp’s are informative for displaying indices and their component parts together

  • We should draw several pcp’s to uncover the different features in a dataset (e.g. food dataset)

THANK YOU!